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When the World Wide Web was first conceived as a way to facilitate the sharing of scientific 
information at the CERN (European Center for Nuclear Research) few could have imagined the 
role it would come to play in the following decades. Since then, the increasing ubiquity of Internet 
access and the frequency with which people interact with it raise the possibility of using the Web to 
better observe, understand, and monitor several aspects of human social behavior. Web sites with 
large numbers of frequently returning users are ideal for this task. If these sites belong to companies 
or universities, their usage patterns can furnish information about the working habits of entire 
populations. In this work, we analyze the properly anonymized logs detailing the access history to 
Emory University's Web site. Emory is a medium size university located in Atlanta, Georgia. We 
find interesting structure in the activity patterns of the domain and study in a systematic way the 
main forces behind the dynamics of the traffic. In particular, we find that linear preferential linking, 
priority based queuing and the decay of the interest for the contents of the pages are the essential 
ingredients to understand the way users navigate the Web. 

PACS numbers: 89.75.Hc,89.70.-a 



I. INTRODUCTION 



The access to Internet has become increasingly pop- 
ular during the last decade. However, despite its im- 
portance, much is still unknown about the Web intrin- 
sic properties, the way people interact with it, and how 
it impacts our culture [TJ |2j |3j [4]. Several theoretical 
approaches have been proposed in the last few years 
[SI [61 [3 [HI [91 [101 HU H2] but some fundamental issues 
remain yet to be fully understood. In this work, we will 
focus on answering the following question. Do any laws 
govern the way and frequency with which a person vis- 
its a given Web site or is each individual intrinsically 
unique? From a sociological point of view, we would 
expect that, although the behavior of a single individ- 
ual is ultimately personal and unpredictable, many in- 
ferences can be obtained about the most common be- 
haviors [2] [13]. A better understanding of the way an 
individual uses a given Web site has important economic 
consequences, as it can help the developers of the site 
optimize it in a way that facilitates its use, and mone- 
tization. Apart from the utilitarian point of view, the 
activity patterns on the sites provide also important in- 
formation on the dynamics of a population. The interac- 
tion with electronic devices or virtual instruments, such 
as social sites or mobile phones, opens promising research 
avenues in this direction [H [HI [E] [T71 US CES [2Q] . 

The sheer size and diversity of the World-Wide Web 
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FIG. 1: Schematic representation of the interactions between 
users and Web pages. The system is dynamic, to provide a 
more visual impression of its variability dashed lines represent 
new added connections. 



renders the attempts to characterize it on a global scale 
hardly feasible. Still several works have recently centered 
in describing from a statistical perspective the structure 
of the Web [2TJ [22] . If instead of understanding its struc- 
ture the goal is to track how users navigate it, the chal- 
lenge becomes even greater. A solution consists in ignor- 
ing the identity of the users, focusing only on the num- 
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ber of visitors per site and on the number of clicks on 
its hyperlinks p~8j [23] . Another possibility is to concen- 
trate the attention onto a group of volunteers [24] or onto 
the users of a social site that are usually well identified 
[25j [26] Hi]. Our aim here is to follow the activity of 
individually trackable Web surfers in a relatively open 
environment and characterize the way in which the in- 
teraction between users and Web sites occurs. This is 
the reason why we analyze the logs of the Web server 
of Emory University. These logs registered the requests 
by Internet users, internal or external, of Web pages in 
the second level of the Emory domain (www.emory.edu). 
The data comprehends a period that goes from Apr. 1, 
2005 to Jan. 17, 2006. Each time a computer connects to 
the Internet, it is assigned a unique IP address that iden- 
tifies it. When a user requests a page from a Web site, 
the IP, the page requested (URL), the time at which the 
request occurred and several other details are registered 
by the Web server. In our case, to preserve privacy the 
data has been anonymized in a coherent way, allowing us 
to follow the behavior of each IP by a single ID number 
but masking the real identity. The log structure is repre- 
sented schematically in Fig. [I] On the left, we have the 
anonymized IP addresses which connect to the URLs on 
the right. To avoid the consideration of different elements 
of a Web such as photos or logos as independent pages, 
we have restricted our definition of URL to (s)htm(l), 
cfm, php, asp(x), jsp and txt documents. Each line of the 
logs corresponds to a different connection, that is times- 
tamped with the date and time at which it took place. 
During our observation period, the domain received over 
3 million visitors to about 2.5 million pages for a grand 
total of over 53 million clicks. 



II. ACTIVITY PATTERNS OF THE 
POPULATION 

Let us start by taking a view of the collective behavior 
of the entire population during the time period for which 
we have data. Intuitively, we expect the activity on a 
domain to vary from day to day, week to week and even 
month to month. In particular, it should be possible to 
observe variations in the activity, seen as the number 
of requests, due to weekends, holidays and other major 
events that disrupt the normal life of the University. The 
traffic at Emory is dominated by students and professors 
in the course of their professional activities and hence 
the major events in the course of the school year, such as 
the beginning and end of a semester, breaks or holidays, 
should be noticeable in the Web traffic. In order to check 
this idea, the number of page requests detected per day is 
shown in Fig.[2]as a function of the observation date. One 
obvious feature of the figure is a clear oscillatory behavior 
with a period of one week. It also displays different trends 
for two special times of the year: one at the later part 
of August, corresponding to the beginning of the school 
year, and the other at the end of December, when the 
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FIG. 2: Total number of clicks registered per day during the 
whole period of traffic observation. The gray bands corre- 
spond to the beginning and the end of the semester: from 
Aug 16 to Aug 31, and from Dec 16 to Dec 31. 



semester finishes. 

Since accesses to Emory domain are mostly work re- 
lated, traffic can be used as an indirect measure of the 
University "productivity". Busier days would result in 
larger amounts of traffic, while during holidays and week- 
ends the number of page requests is overall smaller, thus 
rendering the relative changes in traffic significant. The 
averages of page requests by day of the week during the 
complete observation period are plotted, together with 
their corresponding 95% confidence intervals, in Fig. [3] 
Our results support the old adage that after Wednes- 
day, the hardest part of the week is already behind us, 
with the activity slowly decreasing from then on to the 
weekend. Sundays are the least active day of the week. 




FIG. 3: Comparison between the average week activity and 
activity during Thanksgiving week. The green vertical lines 
represent the beginning and end of the official Thanksgiving 
break at Emory University. The error bars for the average 
are calculated as two times the standard deviation, 2cr, or the 
95% confidence interval. 
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FIG. 4: Average hourly activity in the complete Emory do- 
main as a function of the hour of the day. The curves are 
averaged over the weekdays (circles), Saturdays (squares) and 
Sundays (triangles). 



It is also interesting to note the not-so active behavior 
of Mondays, only slightly more active than Saturdays. 
Armed with an estimate of how activity evolves over the 
week, we are now in a position to evaluate the effects 
of a break. In the same Figure, we also represent the 
data for the days surrounding Thanksgiving, one of the 
major holidays in the US. Traditionally, Thanksgiving 
recess goes from Thanksgiving Thursday till Sunday, so 
one might expect any decreases in activity to be most 
noticeable during this period. This is what we observe, 
but we find other effects as well. Both the Monday before 
and after Thanksgiving seem to be less productive than 
normal. This is however complemented with busier than 
usual Tuesdays before and after the break. 



Intra day variations, with some times of the day being 
busier than others are also seen. By averaging the ac- 
tivity observed at a given hour over all the weekdays in 
our data set, we obtain Fig. [4j The most active period 
is between 7AM and 6PM. The large dip between 11AM 
and 2PM is due to the lunch break. After lunch, the ac- 
tivity peaks reaching the higher level of the day. After 
6PM activity levels off until 10PM, marking the end of 
the workday. Saturdays do not differ significantly from 
other days of the week, only Sundays display a different 
activity profile. Similar patterns for human circadian 
rhythms have been recently reported for other systems 
in Refs. [TSJ [T7J [18] . Such ubiquity indicates important 
universal features (profiles) regarding human habits that 
Web analytics can help to characterize in a quantitative 
way. 



III. INDIVIDUAL ACTIVITIES 

Although interesting, the analysis of averages taken 
over the entire population has limitations. The his- 
tograms of single user activity are typically very wide, 
being in some cases well-modeled by power-law distri- 
butions with exponents smaller than 2 [23]. When this 
happens, it is difficult to identify a "typical" user based 
on such metrics: while most users only visit the domain 
sites a few times, a significant fraction of individuals (as 
identified by their IP addresses) accumulate large num- 
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FIG. 5: Activity history of several individuals: a) what seems 
to be a malicious attack on a finance Web page of the Uni- 
versity, b) an automatic software update program, and c) a 
human user filling data in an administration site. The red 
curves represent the cumulative number of clicks. To facili- 
tate the visualization, the scale of the cumulative and tempo- 
ral number of clicks are different. The axis on the right side 
of each plot displays the scale for the cumulative number of 
clicks. 



4 



bers of page requests. This variability deserves greater 
attention since it can carry important information. Fig- 
ure [5] shows the activity patterns of three users. We do 
not know the actual IPs but it is possible to deduce the in- 
tention of the visit based on the particular URL accessed 
and on the profile of the activity. In Figs.|5]2 and[5f), the 
users are computer programs. One, the case shown in a), 
corresponds to a malicious attack on an finance service 
Web page of Emory. It took place on April the 4th. The 
profile of the number of access attempts per unit of time 
displays a very peculiar shape, quite regular as occurs for 
most automatic navigators, with a very high number of 
requests concentrated in a short period of time. Other, 
more friendly, robots are those corresponding to updating 
programs. An example can be seen in Figure^ where a 
software site in Emory is regularly visited presumably in 
search for new updates. Finally, human users show a very 
different activity profile from that of the machines. The 
activity of a human user selected at random can be seen 
in Figure ^jp. In this case the URL is an administrative 
site that demands manual introduction of data. The ac- 
tivity congregates in some days followed by relative long 
periods of time without any request. 

Given the strong variability in the activity of human 
users, it is interesting to measure some statistics about 
it. In Figure |6j we have represented the histograms of the 
duration of the periods between requests for two different 
scenarios: in Fig. [6)2 for the time between consecutive 
visits of the same user to the same URL, P(r v ), and, in 
Fig. |6f), for the time between clicks by the same user to 
any of the sites in Emory's domain (not necessarily to the 
same URL), P(r c ). Both distributions are rather wide. 
The distribution P(r c ) can be well fitted by a power- 
law decaying function of the type P(r c ) ~ t" 1,25 . The 
distribution of time between consecutive visits, P{t v ), 
decays even more slowly with an exponent of value —1. 
This latter value can be understood thanks to a model 
on human dynamics recently proposed by A.-L. Barabasi 
[28] (see also [TTJ EH El 023 [3T] ) . In this model, an agent 
has to perform a set of tasks each with a random priority 
assigned. A step consists in the selection of the task with 
the highest priority with probability p or of a random 
one with probability 1 — p. After the execution, a new 
tasks occupies the free spot in the queue. This group 
of rules is extremely simple but is able to reproduce a 
distribution of waiting times for the tasks in the queue 
that, in the limit of small p, decays as ~ 1/r. It can be 
argued that consecutive visits to the same site in Emory 
are equivalent to one of these tasks since many of the 
visits are related to work or studies, and probably bear 
an inherent sense of priority for each user. Also returning 
immediately to the same URL and reloading it is not 
a common practice, at least not among humans. It is 
important to note that if the user pushes the back bottom 
in the browser, typically we are not able to detect such a 
move because it does not leave a trace in the logs of the 
server due to browser caching. If each entrance is seen as 
a fresh start of a different task, the parallelism between 
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FIG. 6: Distribution of times between consecutive clicks: a) 
visits of the same user to the same URL, and b) the same 
user to any page of the Emory domain. The straight lines 
correspond to the power-law f(r) ~ r _1 in a) and to f(r) ~ 
r -i.25 - in j n ^Yi e inset of b), the distribution of time in 
the queue is plotted for a variation of Barabasi's model [28] 
(see text) with a number of executed tasks per unit of time 
of v — 3, with probability of choosing a task according to 
priority p = 0.99999, a total of 100 tasks and 10 7 time steps. 



the rules of the model and the way users return to the 
same pages can be justified. 

The question is then whether there is a way to under- 
stand also the exponent —1.25 of P(r c ). The answer is 
yes, if one considers that a single click on the domain 
does not necessarily have to be related to the realization 
of a task. Many tasks will require a (fast) sequence of 
clicks on different sites of the domain for their comple- 
tion. This is why we propose the following modification 
of the model: each time step, instead of a single task, a 
group of v tasks is selected for execution. The selection 
of each of them is done as before: by priority with proba- 
bility p, and at random otherwise. We have performed a 
systematic numerical study of this model and found that 
provided that v > 2 the exponent of the distribution of 
the time of permanence in the queue decays always as 
An example with v = 3 is shown in the in- 
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set of Fig. |6p. These two models are oversimplifications 
but seem able to capture some of the essential features 
present in the dynamics of a large community of users 
leading to the existence of universal exponents. 
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FIG. 7: a) Average variation in a single day of the number of 
different visited sites, (Akip) as a function of the number of 
sites already seen during the previous week, kip. b) The same 
type of function but for the number of visitors to an URL, 
(AkuRL)- And c) the average day variation of the number 
of clicks on each connection IP-URL as a function of the 
clicks accumulated during the previous week, (Aw)(w). The 
insets display the cumulative distributions for each quantity, 
the black curves are obtained by splitting the database in one 
week periods and average over all of them, while the red ones 
are the distributions for the full 292 days period. 



IV. ATTRACTIVENESS AND PREFERENTIAL 
LINKING 



Another aspect that is worth to explore in the dynam- 
ics of our database is whether the new connections or 
new clicks follow a preferential rule. Preferential linking 
or the "rich get richer" effect is a relatively old concept 



considered originally in a socio-economic context by E.H. 
Simon [32]. In the area of graphs theory, it was intro- 
duced in 1999 [7 with a model inspired in the hyperlinks 
of the Web (see also [33, 34 j). A few years have passed, 
and although several attempts have been made to check 
the existence of preferential linking in a variety of sys- 
tems [26j [35j [36j [37] , as far as we know, a systematic 
study of preferentiality on the user- Web relationship is 
still missing. To be precise, if the variable under con- 
sideration x can change in time for each element of the 
system, it is said that it shows linear preferentiality if the 
variation follows on average an expression of the type 



(Ax) « Ax + B, 



(1) 



where the average (.) is taken over all elements i of the 
system with X{ = x, and A and B are constants. This 
mechanism supposes that if the update refers to quanti- 
ties such as number of connections or number of clicks of 
a site, the probability that a particular site is chosen to 
update is proportional to the number of connections or 
clicks that it has previously accumulated. More popular 
sites concentrate thus higher attention leading to an ag- 
glomeration process that, after a while, produces a very 
wide distribution of values of x. If the relation of Eq. 
is linear, the distribution P(x) can be approached 
by a decaying power-law function with an exponent de- 
pending on the values of A and B [4]. If it is not linear, 
two simple scenarios can occur. Either Ax grows with 
x faster than linear and the most popular element will 
eventually congregate a finite fraction of all the available 
value of x, or it is sublinear and the distribution of values 
of x will not be wide (stretched exponential instead of a 
power-law) HEEIEE]. 

In our case, the "elements" of the system are Web 
pages and IPs, and the quantity x can be, among other 
things, the number of clicks of a certain user on a given 
URL, which we call w, the number of different users that 
an URL receives kjjRL or the number of different sites 
that an IP visits kip. We have also performed a similar 
study for the activity of the URLs and IPs (defined as 
the number of requests received or sent), but the results 
are similar. We will focus therefore our attention only on 
kjjRL-) kip and w. The variation of each of these variables 
Ax in a single day is measured after having accumulated 
the values of x for a full week. Then an average is taken 
over all the weeks of the database. The results display- 
ing Ax as a function of x are depicted in Fig. [7[ The 
variation of fc/p, kuRL and w can be well approached by 
linear preferential functions similar to Eq. (TJ (straight 
lines in the main plots). This means that the rate at 
which users explore the Web (Afc/p), the rate at which 
popular pages attract new users (AkuRL) and the rate 
at which users revisit Web pages (Aw) depend linearly 
on the previous week performance. It should also imply 
that the distributions P(fcpp), P(kjjRL) and P(w) are 
wide and well fitted by a power-law. In order to check 
this last point, we have measured the cumulative distri- 
butions C(x) = J dyP(y) for the three quantities. The 
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cumulative distribution C(x) is the probability of hav- 
ing a value of the variable greater than x and usually 
exhibits better statistics than P(x). Note that if P(x) 



goes as P(x) 



r , then C(x) ~ x 1 1 . The results are 



shown in the insets of Fig. [7[ In these plots, we have 
also included the cumulative distributions estimated ag- 
gregating the values of kjjRL, kip and w for the whole 
period of the database (292 days). The comparison of 
the cumulative distributions obtained for the two time 
windows reserves us an important surprise. For C(/c/p), 
the two curves overlap and can be fitted with a power- 
law of exponent 7 ~ 2.2. However, this is not true for 
the popularity of the URLs, kuRL, or for w. This dif- 
ference in the output depending on the extension of the 
time window has important consequences for modeling 
the dynamics of the system. Its origin is related to the 
fact that in a university the time during which a site, or 
more specifically its content, is relevant closely tracks the 
evolution of the academic year. In general, a similar rule 
should apply to all the Web sites. The life time can be 
more flexible, depending also on the number of visitors, 
but a certain loss of interest as the time passes since the 
first online publication can be expected [12] . After this 
time, the page does not attract new users or visits from 
the old ones at the same rhythm (if attracts any at all). 
This breaks one of the implicit assumptions of preferen- 
tial linking: new elements are added at a constant rate, 
while the old ones keep attracting attention indefinitely. 
It also implies that linear preferential linking is not valid 
for longer time windows for kjjRL and w, and that their 
distributions cannot be modeled as simple (stable over 
time windows) power-laws. 

To visualize the life story of URLs, we represent in 
Figure [8] the number of pages first seen or last seen in the 
system as a function of time. We will say that a certain 
URL U is first seen at time t if it receives its first request 
at t. Complementarity, the time in which U is last seen, 
disappearing from the database, is when it receives the 
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FIG. 8: The number of URLs that are first (last) seen as a 
function of time. The two major "extinction" and "creation" 
events, correspond to the beginning and end of the semester 
and closely match the peaks detected in Fig. [2] 



last registered visit. Note that, although similar in look, 
this plot is different from Fig. [2] where we are plotting 
the activity measured as the total number of clicks on the 
Emory domain as a function of time. Two large peaks can 
be seen in Fig. [8] The time of these peaks coincides with 
the end and beginning of the semester. Many Web pages 
seem thus to have a relative short life, probably being 
set up by professors or students that abandon them at 
the end of the semester. In many cases, even the http 
addresses are no longer maintained. 



V. DISCUSSION AND CONCLUSIONS 

Web server logs have proven to be an important source 
of information regarding human dynamics. Here we have 
offered an extensive study on the medium size Web do- 
main of Emory University tracking the users in a consis- 
tent way for 292 days. A clear signal of human circadian 
rhythms has been obtained as well as activity patterns 
that seem to be universal since they are in agreement 
with previous results on mobile phone records or email 
posting in social sites. In addition, in this case, the online 
traffic can be related to the productivity of the members 
of the University, namely, students, professors and staff. 
The comparison between the activity of an ideal average 
week and the week containing Thanksgiving is revealing 
in this sense, with some days concentrating an important 
level traffic, much higher than the average, and others 
falling clearly behind. 

After the characterization of activity at the whole Uni- 
versity scale, we have moved our focus down to the study 
of statistics of single users. The difference in the naviga- 
tion patterns between humans and automatic processes, 
either malicious or friendly, has been highlighted. Hu- 
mans are in general more unpredictable, although a sim- 
ilar behavior might be reproduced by sophisticated auto- 
matic means. In particular for human users, it is impor- 
tant to analyze the statistics of the times between events 
(clicks) and compare them with recently introduced mod- 
els based on priority queues. We have shown that indeed 
such models are able to explain the inter-clicks period 
distribution if the dyad user-site is considered. Further- 
more, a simple modification, in which the number of tasks 
to execute in a short interval of time is higher than one, 
can also account for the statistics of times between re- 
quests of the same user on the whole Emory domain. 

Finally, we have explored another mechanism that has 
been proposed as an important ingredient in the develop- 
ment of the WWW, namely "preferential attachment". 
Linear preferential attractiveness is detected in all the as- 
pects of the traffic contemplated: the rate of exploration 
of new sites by the users, the capture of new visitors 
by the sites or the new clicks received on each connec- 
tion user- Web page. In all these cases, the linear relation 
holds in short period of time. For longer periods, the life- 
time of the Web pages must be taken into account, com- 
plicating substantially the scenario. Preferential linking, 
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priority queuing and Web page aging seem thus to be es- 
sential factors for any model aimed to characterize Web 
surfing. 
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